DOCUMENT RESUME 



ED 342 798 



TM 017 963 



AUTHOR 
TITLE 

INSTITUTION 

S?ONS AGENCY 

PUB DATE 
CONTRACT 
NOTE 

PUB TYPE 



Moody, David 

Strategies for Statewide Student Assessment* Policy 
Briefs, Number 17* 

Far West Lab. for Educational Research and 

Development, San Francisco, Calit. 

Office of Educational Research and Improvement (ED), 

Washington, DC. 

91 

400-86-0009 
5P- 

Reports - Evaluative/Feasibility (142) 



EDRS PRICE 
DESCRIPTORS 



IDENTIFIERS 



MFOl/PCOl Plus Postage. 

Achievement Tests? Basic Skills; ^Educational 
Assessment; Educational Change; Elementary Secondary 
Education; Evaluation Methods; Holistic Evaluation; 
Psychometrics; Scoring; ^Standardized Tests; *State 
Programs; *Student Evaluation; Testing Problems; 
Testing Programs; *Test Use 

Alternatives to Standardized Testing; * Authentic 
Assessment; ^Performance Based Evaluation 



ABSTRACT 

Traditional standardized tests of basic skills are no 
longer considered meaningful by many leading authorities in 
educational measurement. Alternative approaches are not yet fully 
developed, although many efforts are bei.ng made. This paper explores 
the issues surrounding student assessment in the context of existing 
and evolving state practices, which frequently combine high-stakes 
evaluations with traditional multiple-choice norm-referenced 
excuninations and negatively affect instructional quality. A new 
generation cf alternative strategies for student evaluation is being 
designed to measure student performance in situations that bear an 
authentic relationship to real-world tasks. Authentic assessment is 
criterion-referenced and performance-based. It has intrinsic validity 
and a holistic approach. Because authentic assessments are typically 
much more difficult to score than traditional tests, they are 
expensive. The psychometric foundations of authentic assessments are 
still not fully developed. Vermont, Michigan, and Kentucky are 
leaders in the effort to use authenti': assessment for statewide 
testing programs. The fact that many problems surround the changing 
nature of student assessment means that caution must be exercised in 
using assessment as a tool of educational reform. (SLD) 



* Reproductions supplied by EDRS are the best that can be made * 

* from the original document. * 



1 



FAR 
WEST 

LABORATORY 



SCOPE OF INTEREST NOTICE 

The ERIC Facility has assiflned 
thia document for processing 

to: 



POUCY 



In our judgment, this document 
Is also of interest to the Clear 
inghouses noted to the nght. 
IndeMing should reflect (heir 
special pjints of view. 



1991 



BRIEFS 



NUMBER SEVENTEEN 



U.S. OCPAfTTMCNT Of EDUCATION 

Office of EdiK^ltonal Rataarch ar>d Improvement 

EDUCATIONAL RESOURCES INFORMATION 
CENTER (ERIC) 



(l/Th( 



rThiS document has tMen reproduced as 
received from the Person or organization 
originating it. 

O Minor changes have been made to improve 
reproduction quality 

e Points of view or opinions staled in this docu 
ment do not necessarily represent official 
OERI position or poliCy 



Strategies for 
Statewide Student Assessment 

David Moody 



00 

I* 



Introduction 

Over the course of the last 
decade, statewide assessments of 
student achievement have assumed a 
position of prominence in the land- 
scape of educational reform. Prior to 
1980, nearly half of all states man- 
dated no programs of ttiis kind. 
Today, statewide assessments are all 
but ubiquitous, and the rature of the 
deF>ate has shifted away from whether 
to conduct such programs to what 
kincl oi program to conduct. 

As this Policy Brief explains, 
traditional, standardized, multiple- 
choice tests of basic skills are no 
longer considered meaningful by 
many leading authorities in educa- 
tional measurement. Alternative 
approaches^ on tiie other hand, are 
not yet fully developed, alttiough 
innovative efforts are prolif^ting. 
The field of student assessment, in 
short, is imdergoing a profound 
transformation at the very time that 
the demand for achievement data is 
growing ever more acute. 

This Brie/ is designed to explore 
these issues in the context of existing 



Far Yiest Laboratory for Educatumd 
Research end Development serves the 
four-state region of Arizona, 
California, Nevada, and Utah, 
working rvith educators at all levels to 
plan and carry out school 
improvements. Part of our mission is 
to help state department staff, district 
superintendents, school principals, 
and classroom teachers keep abreast of 
the best current thinking and practice. 



and evolving statewide practices. 
First, the essential nature of tlie 
controversy associated with ;;tudent 
assessment is explained. Then the 
variety of assessment strategies 
currently visible on the statewide 
level is reviewed. The Bri^ concludes 
with some issues policymakers need 
to consider as they mandate new 
strategies for statewide student 
assessment. 

The Nature of the Controversy 

The assessment of student 
achievement is a conflicted and 
sensitive area of educational policy 
and practice. The nahire of tihe 
controversy can be traced to the 
multiple uses of testing data tlvough- 
out the educatioruil system. 

Shident testing data has in the 
past served two very different 
purposes: to guide instructlc lal 
practice and to evaluate the practitio- 
ners. Teachers use test results for 
diagnostic ptuposes, to identify what 
students know and what they need to 
learn. State and local admirustrators 
use test results for accotmtability 
purposes, to identify strengttis and 
weedcnesses among programs and 
personnel. 

Ordinarily, different kinds of 
tests are used for these two purposes. 
Teachers commonly devise their own 
tests, trade with other teachers, or use 
tests presented in classroom texts. 
Administrators, by contrast, tend to 
prefer commerciaily-produced 
instruments designed to compare the 
performance of large ntimbers of 



students (whole schools or school 
districts) with national norms. 

Problems arise when the school 
performance measures carry high- 
stakes rewards and sanctions. Teach- 
ers then become acutely aware of 
these consequences and tend to teach 
to the test. 'Ilus phenomenon is so 
well known that practitioners have 
coined an aaonym for it: WYTIVVYG 
— what you test is what you get. 
Teachers not only gear dassroom 
activities to the evaluative assess- 
ments, but often also substitute those 
foi their own diagnostic measures. At 
that point tihe distinction between 
instructional and evaluative assess- 
ment purposes begins to break down. 

To tailor instruction to a stan- 
dardized tesit may at first glance seem 
a desirable result; testing for essential 
skills, for example, should (by the 
logic of WYTIWYG) produce students 
with essential skills. Upon doser 
inspection, however, WYTIWYG 
contains a deeper, more sinister 
implication. It implies that the wrong 
kind of test may have far-reaching, 
negative effects on the quality of 
d?.ssroom instruction. In particular, a 
multiple-choice, "fill in the bubble'' 
type of examination may lead to 
Trivial Pursuit-type instruction that 
produces studente who can memorize 
wdl but are rarely challenged to 
exerdse "higher-order'' thinking 
skills: to think critically aiid deeply; to 
apply knowledge in novel situations; 
to integrate many discrete pieces of 
information; and to collaborate with 
others in the solution of complex 
problems. 



? RF!:T RIIPY AVAILABLE 



Combine these two factors — 
high* stakes evaluative assessments 
that end up driving instruction and a 
testing instrument that reflects a 
narrow subset of legitimate learning 
objectives — and instructional quality 
is likdy to suffer seriously. Unfortu- 
nately, statewide assessment strategies 
commonly meet both of these criteria. 

State departmaits of education, for 
example, typically use student test 
scores to regulate flows of money to 
schools, to establish ^candards for 
graduation or for entrance into special 
programs, and to identify schools and 
districts in need of some f onn of 
external intervention. Most states make 
test results public, and the sheer 
attention focused on tite comj^rative 
performance of schools and districts 
serves as a powerful pressure on school 
personnel. In addition to attaching high- 
stakes consequences, many statewide 
assessment strategies continue to rely on 
traditional, norm-referenced, mtdtiple- 
choice exams. 

Traditional md Authentic 
Assessment 

The advent of large-scale testing 
programs coupled witti rewards and 
sanctions has brought a spodight to bear 
upon the deficiencies in traditional 
testing practices. A new generation of 
alternative strategic for student 
assessment is being designed specifi- 
cally to redress ttiese defidendes. The 
cornerstone of these new strategies is 
autlienticihf, assessment is based on 
students' performance in situations that 
bear an authentic reladonship to ''real 
world" tasks. Writing essays, perform- 
ing sdence experiments, i^rtidpating in 
collaborari ve activities — these are now 
the substance of assessment practice 
itself, rather P .an the distant goal for 
which testing serves as a convenient 
substitute. 

A comparison of traditional with 
authentic forms of assessment reveals 
their respective strengths and weak- 
nesses. Strictly speaking, traditional 
testing practices embrace a variety of 
approaches to assessment. In the 
foUowing discussion, however, the 
phrase traditional testing refers specifi- 



cally to standardized, nojm-refer- 
enced, multiple-choice exams of the 
kind exemplified by the Iowa Test of 
Basic Skills. 

Authentic assessment differs 
from traditional testing in several 
respects. A first and fundamental 
difference pertains to the form in 
which the results are reported. Norm- 
referenced tests give only relative, 
comparative data: they compare the 
performance of a given student with 
the norm established by the perfor- 
mance of his or her peers. Such 
results are commonly expressed as 
percentiles — the percentage of 
students who scored better or worse 
than the student in question. 

Authentic assessment, by contrast, 
expresses results with referervre to 
actual, concrete skills tiiat a student lias 
or has not mastered. Sudi tests are said 
to be criterion-r^erenced. A criterion- 
referenced test wUl tell whether Jason 
or Jennifer can or caimot add or 
subtract fractions; a norm^eferenced 
te:»t will tell only whether they are 
doing better or worse than other 
students of their grade level or age. 

In orde to qualify as authentic 
assessment, however, criterion- 
referenced results repiesent or\ly a 
first step. Aiiother characteristic of 
authentidty pertains to the level and 
complexity of the skills under exami- 
nation. A paper and pencil test of 
musical ability might reveal a certain, 
narrow subset of profidendes — 
whether the student can correctiy 
name the notes, for example. The 
actual performance of a composition 
on the piano, by contrast, caUs into 
play technical, kinesthetic, and 
aesthetic abilities, in addition to the 
ability to dedpher musical notation. 

Authentic assessment, therefore, 
is both criterion-referenced and 
performance-based. In addition, it 
characteristically dispiays a quality of 
immediacy, vitsdity, and meaning to 
thv^ test-taker: it has intrinsic validity. 
In most forms of traditional testing, 
validity is a laboriously achieved 
product of finding isolated, artifidal 



tasks that correlate reasonably well 
with the desired skill. In authentic 
assessment, tlie test is the demonstra- 
tion of the desired skill itself. 

Another central feature of 
^entic assessment is that it is 
ic. A traditional test of writing 
L alls by necessity isolates a large 
number of disaete bits of knowledge. 
Direct vmting assessment, however, 
requires the student actually to 
produce an essay, in whid\ the 
counfless bits of knowledge are 
blended into a single whole. 

In part because they are holistic 
au^entic assessments are tj^pically 
far lore difficult to score titan are 
traditional tests. Indeed, the hallmark 
and prindpal virtue of the standard- 
ized, multiple-choice exam is ease of 
scoring — most are scored with great 
rapidity, litfle cost, and complete 
reliability by machine. 

In order to score an authentic 
assessment, however, a team of 
human observers is generally re- 
quired. Just as pands of experts score 
tiie performances of Olympic gym- 
nasts, a collection of teachers is 
ideally available to evaluate a 
student's ability to conduct a debate, 
deliver a speech, or investigate a 
sdentific question. Collective obser- 
vation and evaluation of this kind 
require extensive preparation in order 
to coordinate scoring procedures and 
criteria* 

While the art or sdence of 
authentic assessment is no longer in 
its infancy, ndther is it a mature and 
stalie disdpline. Its basic weaknesses 
are its cost, and the incompleteness of 
its psychometric foundations. Yet 
auti-ientic assessments evaluate a 
much fuller range of student abilities 
than is possible with a multiple- 
choice exam, and the testing process 
makes much more sense to students 
md teachers, /luthentic assessment 
".sks are designed to be complex, 
integrated, and challenging; in this 
way, they mirror and support p od 
instruction. Due to this combiiw-aon 
of strengths and weaknesses, tradi- 



tional and authentic forms of assess- 
ment vnll probably continue to co- 
exist for some time to come. 

The State of State Assessment 

As of fall, 1991, 44 states and the 
District of Columbia had some form 
of mandated statewide testing 
program in place; of the remaining six 
states, all but Nebraska had plans in 
progress. Reading, mathematics, and 
language arts were each tested in 40 
states or more. More than half of all 
states prescribed testing in writing, 
science, and sodal studies. 

The most common usage of 
statewide assessments is to monitor 
the progress of individual students. 
For example, results may function as 
a criterion for grade promotion. 
Seventeen states require students to 
pass a statew.de minimum compe- 
tency cxam in order to graduate; 
California, Delaware, and Virginia 
require school districts to select and 
implement such exams. Arkansas and 
Virginia eotiploy tests as an entrance 
requirement for high school. 

The next most common usage of 
statewide test results is for school-site 
accountability —both to statelevel 
authorides and to the public at large. 
Results are often published, for 
example, in comprehensive reports of 
educational progress among schools 
and districts throughout the state. 

Finally, in 20 states, funding 
decisions hinge upon results of 
student assessments. In some states, 
low-achieving schools receive added 
funds, while in ottiers extra money is 
a reward to schools whose scores are 
high. 

The majority of states continue to 
employ nationally normed, multiple- 
choice exams, such r.s the Iowa Test of 
Basic Skills, the Stanford Achieve 
ment Test, and the Comprehensive 
Test of Basic Skill:.. Because they are 
norm-referenced, these tests allow 
states to compare their students' 
performance with that of students 
throughout the nation. 



ERLC 



Criterion-referenced tests — 
those whose outcomes are stated with 
reference to specific skills — are used 
in combination with norm-referenced 
exams in 25 states and are relied upon 
exclusively in 10 others. Such tests are 
most commonly employed in states 
that have articulated specific learning 
objectives for variotis ages or grade 
levels. Criterion-referenced results tell 
how many students have met those 
particular leandng goals. 

Among assessments that go 
beyond the multiple-choice format, 
writing samples are the most wide- 
spread, occuning now in some 28 
states. A dozen states are moving 
toward performance-based assess- 
ment of mathenatics achievement, 
and eight use such approaches for 
secondary-level science. 

Pioneers: Vermont, Michigan and 
Kentucky 

Several states, including Ver- 
mont, Michigan, and Kentucky, are in 
the process of reconstructing their 
strategies for student assessment. 
Vermont is noteworthy for the extent 
of its commitment to authentic 
assessment; Michigan for its leader- 
ship in devising low-cost, perfor- 
mance-based testing programs; and 
Kentucky for its radical revision of 
the educational system as a whole. 

Vermont has completed the first 
year of a pilot program of statewide 
student portfolios in writing and 
math. The state allowed participating 
teachers to use their own judgment in 
selecting students' best and most 
characteristic pieces of classroom 
work for inclusion in portfolios. This 
process worked reasonably well in 
the assessment of shident writing; but 
scorers of the math portfolios discov- 
ered that some forty percent of the 
submissions consisted of worksheets. 
Specification of more lifelike, perfor- 
mance-based math activities is high 
on the list of changes for this year's 
full-scale implementation of portfolio 
assessments in grades 4 and 11 . 



4 



The challenges Vermont faced in 
shifting to authentic assessment were 
both greater than and different from 
those anticipated. Scoring, for ex- 
ample, proved to be siuprisingly 
straightforward — although the math 
teachers preferred a quantitative 
scale, wHle the English teachers 
insisted on verbal descriptors. 
Nevertheless, all the participants 
continue ^o endorse the process. 
Vermont 1^ ads the nation in another 
respect as well: its assessment 
strategy is deliberately designed not 
to indude high-stakes consequences. 

Michigan is a pio^^eer in finding 
ways to tmdertake performance- 
based assessments with the minimum 
possible cost. Hie state has developed 
a battery of assessment programs on 
an inaemcntal basis, adding one or 
two in different subject areas each 
year since 1986. To date, performance 
assessments have been developed in 
art, music, math, science, social 
studies, and physical education. 

The key to carrying out such a 
prrjgram on a low-cost basis is to find 
people who strongly desire to partici- 
pate, and to enlist their assistance at 
every stage of the process: develop- 
ment, administration, andinterpretci- 
tion. In Michigan, administration of 
the assessmeiit procedures and 
interpretation of results are under- 
taken by graduate students in each 
re^ ion of the state, as wdl as by 
preservice and inservice personnel. 

A cour'i'oom challenge to 
Kentucky's formula for fiiumdng 
school districts had an imexpected 
outcome: the State Supreme Court 
ruled that the educational system as a 
whole, "m all its parts and parcels," 
was constitutionally invalid. The 1990 
legislatore faced the daimting task of 
mandating a new code for education 
from the ground up. In so doing, it 
wrote into law a dual system of 
statewide assessment strategies: a 
nationally normed, traditional test of 
academic skills was coupled with an 
exte.isive, to-bt .-developed, perfor- 
mance-based system. 



Briefs 



What distinguishes Kentuck/s 
strategy, however, is not this dual 
approach, but rather the extremely 
high stakes attadied to site-level 
outcomes. The law mai.dates precise 
levels of improvement in student 
percentiles that trigger release of 
supplementary funds. Schools whose 
poiormance declines by set amounts 
incur an avalanche of penalties: 
parents are notified that their children 
are now fiee to transfer; all certified 
staff are automatically placed on 
probation; and a Kentucky ''distin- 
guished edticator'' visits the site to 
determine v Wch employees, from 
teachers t^lrou^h disti^^t superinten- 
dents, sh^dl continue to hold their jobs. 

The lifu. ated nature of the 
assessment f Ad today is reflected in a 
pair of developments at the national 
level* On one hand, The National 
Assessment of Educational Progress 
has recently entered upon the state-by- 
state assessment of student achieve- 
ment, using a traditional testing 
format. It is now possible to compare 
the math achievement of eighth- 
graders in Colorado with that in 
Maryland or Hawaii on a common 
scale. 

On the other hand, in a parallel 
initiative from the federal level, the 
Office of Educational Research and 
Improvement has funded a new center 
(under the auspices of UCLA's Center 
for Research on Evaluation, Standards, 
and Student Testing) designed to help 
states exchange information about 
authentic, performance-based assess- 
ments, sudK as the portfolio approach 
under development in Vermont. 

Issues to Consider 

The diversity of mandated student 
assessment programs aaoss the 50 
states reflects competing viewpoints 
about appropriate ways to test stu- 
dents. As policymakers re-think 
student assessment systems, they need 
to consider a variety of issues: 

• Testing programs are not an 

independent element in the educa- 
tional system; on the contrary, they 



interact with actual instruction in 
significant and often unexpected 
ways. These must be antidpeted 
and monitored with care* 

• State demands for accountability, 
or "top^own" refonn, could 
directly impede a major 1x>ttom - 
up'' rdorm strategy: the restora- 
tion of authority to teachers and 
principals at the school site. Top- 
down regulation c( the system 
may inadvertently contribute to 
passivity and burnout among 
those charged witih the actual 
delivery of educational services. 

• The costs of state-mandated testing 
must be considered. What fraction 
of tiie total educational budget 
does student assessment wairiuit? 
In an era of scarce resources, tlds 
decision must be weighed against 
alternative educational needs such 
as funding for restructuring, 
professional devdopment, or 
curricular iniu)vations. 

• State-level support for assessment 
research should be a corollary of 
the state-level demand for educa- 
tional accountability. As the 
demand for assessment data 
continues to grow, policymrkers 
mast recognize that the field of 
student testing is undergoing 
fundamental change. In order to 
ensure an effective transition, more 
research into the statistical under- 
pinnings and the instructional 
consequences of altemativ;: 
assessment strategies is needed. 

The assessment of student 
academic achievement is evidently 
not as straightforward a matter as the 
evaluation of athletic accomplish- 
ment, for example. Learning is not as 
susceptible to objective observation 
and measurement as is phy iical 
prowess, and high-stakes c ducational. 
testing may interact with the very 
system it is attempting to measure. 
Tlie prudent poUcymaker would do 
wall to tread gently in this sensitive 
territory, and to exercise caution in 
seeking to use assessment as a 
primary tool of reform. 



Briefs is published by the Far West 
Laboratory for Educational 
Research and Development. The 
publication is supported by federal 
funds ftom the U. S. Department of 
Education, Office of Educational 
Research and Improvement, con- 
tract no. 400-86-0009. The contents 
of this publication do not neccessar- 
ily reflect the views or policies of 
the Departir ent of Education, nor 
does mention of trade na jies, com- 
mercial products or organizations, 
imply endorsement by the United 
States Government. Reprint rights 
are granted with proper credit. 

Staff 

Mary Amsler 

Director, Policy Support 
Services 

David Moody 
Senior Research Associate 

James N. Johnson 

Communications Director 

Fredrika Baer 

Administrative Assistant 



Briefs welcomes your comments 
and suggestions. For additional 
information, please contact: 

James N. Johnson 

Far West Laboratory 
730 Harrison Street 
San Franciscc, CA 94107 
(415) 565-3000 



FAR WEST LABORATORY 
FOR EDUCATIONAL RESEARCH 
AND DEVELOPMENT 
730 HARRISCMM STREET 
SAN FRANCISCO, CA 94107 
(415) 565-3000 



ObRI 

OFHCE OF EDUCATIONAL 
RESEARCH AND DEVELOPME^^T 
U.S. DEPARTMENT OF 
EDUCATION 



